Estimation and Inference for High Dimensional Generalized Linear Models: A Splitting and Smoothing Approach

The focus of modern biomedical studies has gradually shifted to explanation and estimation of joint effects of high dimensional predictors on disease risks. Quantifying uncertainty in these estimates may provide valuable insight into prevention strategies or treatment decisions for both patients and physicians. High dimensional inference, including confidence intervals and hypothesis testing, has sparked much interest. While much work has been done in the linear regression setting, there is lack of literature on inference for high dimensional generalized linear models. We propose a novel and computationally feasible method, which accommodates a variety of outcome types, including normal, binomial, and Poisson data. We use a “splitting and smoothing” approach, which splits samples into two parts, performs variable selection using one part and conducts partial regression with the other part. Averaging the estimates over multiple random splits, we obtain the smoothed estimates, which are numerically stable. We show that the estimates are consistent, asymptotically normal, and construct confidence intervals with proper coverage probabilities for all predictors. We examine the finite sample performance of our method by comparing it with the existing methods and applying it to analyze a lung cancer cohort study.


Introduction
In the big data era, high dimensional regression has been widely used to address questions arising from many scientific fields, ranging from genomics to sociology (Hastie et al., 2009;Fan and Lv, 2010). For example, modern biomedical research has gradually shifted to understanding joint effects of high dimensional predictors on disease outcomes (e.g. molecular biomarkers on the onset of lung cancer) (Vaske et al., 2010;Chen and Yan, 2014, among others). A motivating clinical study is the Boston Lung Cancer Survivor Cohort (BLCSC), one of the largest comprehensive lung cancer survivor cohorts, which investigates the molecular mechanisms underlying lung cancer (Christiani, 2017). Using a target gene approach (Moon et al., 2003;Garrigos et al., 2018;Ho et al., 2019), we analyzed a subset of 708 lung cancer patients and 751 controls, with 6,800 single nucleotide polymorphisms (SNPs) from 15 cancer related genes, in addition to demographic variables such as age, gender, race, education level, and smoking status. Our objective was to determine which covariates were predictive in distinguishing cases from controls. As smoking is known to play a significant role in the development of lung cancer, we were interested in estimating and testing the interaction between smoking status (never versus ever smoked) and each SNP, in addition to the main effect of the SNP. Quantifying uncertainty of the estimated effects helps inform prevention strategies or treatment decisions for patients and physicians (Minnier et al., 2011).
Considerable progress has been made in drawing inferences based on penalized linear models (Zhang and Zhang, 2014;Javanmard and Montanari, 2014;Bühlmann et al., 2014;Dezeure et al., 2015). While techniques for variable selection and estimation in high dimensional settings have been extended to generalized linear models (GLMs) and beyond (Van de Geer, 2008;Fan et al., 2009;Witten and Tibshirani, 2009), high dimensional inference in these settings is still at its infancy stage. For example, Bühlmann et al. (2014) generalized the de-sparsified LASSO to high dimensional GLMs, while Ning and Liu (2017) proposed a de-correlated score test for penalized M-estimators. In the presence of high dimensional control variables, Belloni et al. (2014Belloni et al. ( , 2016 proposed a post-double selection procedure for estimation and inference of a single treatment effect and Lee et al. (2016) characterized the distribution of a post-LASSO-selection estimator conditional on the selected variables, but only for the linear regression.
However, the performance of these methods may depend heavily on tuning parameters, often chosen by computationally intensive cross-validation. Also, these methods may require inverting a p × p information matrix (where p is the number of predictors), or equivalently, estimating a p × p precision matrix, with extensive computation and stringent technical conditions. For example, the sparse precision matrix assumption may be violated in GLMs, resulting in biased estimates (Xia et al., 2020).
We propose a new approach for drawing inference with high dimensional GLMs. The idea is to randomly split the samples into two sub-samples (Meinshausen et al., 2009), use the first sub-sample to select a subset of important predictors and achieve dimension reduction, and use the remaining samples to parallelly fit low dimensional GLMs by appending each predictor to the selected set, one at a time, to obtain the estimated coefficient for each predictor, regardless of being selected or not. As with other methods for high dimensional regression (Zhang and Zhang, 2014;Javanmard and Montanari, 2014;Bühlmann et al., 2014), one key assumption is that the number of non-zero components of β* is small relative to the sample size, where β* are the true values underlying the parameter vector, β, in a regression model. The sparsity condition is reasonable in some biomedical applications. For example, in the context of cancer genomics, it is likely that a certain type of cancer is related to only a handful of oncogenes and tumor suppressor genes (Lee and Muller, 2010;Goossens et al., 2015). Under this sparsity condition, we show that our proposed estimates are consistent and asymptotically normal. However, these estimates can be highly variable due to both the random splitting of data and the variation incurred through selection. To stabilize the estimation and account for the variation induced by variable selection, we repeat the random splitting a number of times and average the resulting estimates to obtain the smoothed estimates. These smoothed estimators are consistent and asymptotically normal, with improved efficiency.
Our approach, termed Splitting and Smoothing for GLM (SSGLM), aligns with multisample splitting (Meinshausen et al., 2009;Wang et al., 2020) and bagging (Bühlmann and Yu, 2002;Friedman and Hall, 2007;Efron, 2014), and differs from the existing methods based on penalized regression (Zhang and Zhang, 2014;Bühlmann et al., 2014;Ning and Liu, 2017;Javanmard and Montanari, 2018). The procedure has several novelties. First, it addresses the high dimensional estimation problem through the aggregation of low dimensional estimations and presents computational advantages over other existing methods. For example, de-biased methods (Bühlmann et al., 2014;Javanmard and Montanari, 2018) require well-estimated high dimensional precision matrices for proper inference (e.g. correct coverage probabilities), which is statistically and computationally challenging. Complicated procedures that involve choosing a large number of tuning parameters are needed to strike a balance between estimation accuracy and model complexity; see Bühlmann et al. (2014) and Javanmard and Montanari (2014). In contrast, our algorithm is more straightforward as it avoids estimating a high dimensional precision matrix by adopting a "split and select" strategy with minimal tuning. Second, we have derived the variance estimator using the infinitesimal jackknife method adapted to the splitting and smoothing procedure (Efron, 2014). This is free of parametric assumptions and leads to confidence intervals with correct coverage probabilities. Third, we have relaxed the stringent "selection consistency" assumption on variable selection, which is required in Fei et al. (2019). Our procedure is valid with a mild "sure screening" assumption for the selection method. Finally, our framework facilitates hypothesis testing and drawing inference on predetermined contrasts in the presence of high dimensional nuisance parameters.
The rest of the paper is organized as follows. Section 2 describes the SSGLM procedure and Section 3 introduces its theoretical properties. Section 4 describes the inferential procedure and Section 5 extends it to accommodate any sub-vectors of parameters of interest. Section 6 provides simulations and comparisons with the existing methods. Section 7 reports our analysis of the BLCSC data. We conclude the paper with a brief discussion in Section 8. the linear exponential distribution family, which includes the normal, Bernoulli, Poisson, and negative-binomial distributions. That is, given x, the conditional density function for Y is where A(·) is a specified function that links the mean of Y to x through θ. We assume the second derivative of A(θ) is continuous and positive. We consider the canonical mean parameter, θ = xβ, where x = (1, x) and β = (β 0 , β 1 , …, β p ) T include an intercept term. Specifically, denote μ = E Y x = A′ θ = g −1 xβ , and V(Y|x) = A″(θ) = ν(μ), where g(·) and ν(·) are the link and variance functions, respectively.
We add an index set, S ⊂ {1, 2, …, p}, to the subscripts of vectors and matrices to index subvectors x iS = (x ij ) j∈S and x iS = 1, x iS , and submatrices X S = (X j ) j∈S and X S = 1, X S .
We write β S = (β 0 , β j ) j∈S , which always includes the intercept and is of length 1 + |S|. The negative log-likelihood for model (1) that regresses Y on X S (termed the partial regression) is Similarly, U S β S = n −1 X S T A′ X S β S − Y and I S β S = n −1 X S T V S X S , where define the "observed" sub-information by I S = I S (β S * ), and the "expected" sub-information by I S = E I S . The latter is equal to the submatrix of I* with rows and columns indexed by S, which is denoted by I S * .

Proposed SSGLM estimator
Under model (1), we assume a sparsity condition that s 0 is small relative to the sample size and will be detailed in Section 3. We randomly split the samples, D (n) , into two parts, D 1 and D 2 , with sample sizes |D 1 | = n 1 , |D 2 | = n 2 , respectively, such that n 1 + n 2 = n. For example, we can consider an equal splitting with n 1 = n 2 = n/2. We apply a variable selection scheme, S λ , where λ denotes the tuning parameters, to D 2 to select a subset of important predictors S ⊂ {1, …, p}, with |S| < n for dimension reduction. Then using D 1 = (Y 1 , X 1 ), for each j = 1, 2, …, p, we fit a low dimensional GLM by regressing Y 1 on X S +j 1 , where S +j = {j} ∪ S. Denote the maximum likelihood estimate (MLE) of each fitted model as β S +j , and define β j = β S +j j , the element of β S +j corresponding to predictor X j . We denote by β 0 the estimator of the intercept from the model Y 1 X S 1 . Thus, the one-time estimator based on a single data split is defined as β S +j = argmin β S +j ℓ S +j β S +j = argmin β S +j ℓ β S +j ; Y 1 , X S +j 1 ; β j = β S +j j ; β = β 0 , β 1 , …, β p . The rationale for this one-time estimator is that if the subset of important predictors, S, is equal to or contains the active set, S*, then β j would be a consistent estimator regardless of whether variable j is selected or not (Fei et al., 2019). We show in Theorem 1 that the one-time estimator is indeed consistent and asymptotically normal in the GLM setting.
However, the estimator based on a single split is highly variable, making it difficult to separate true signals from noises. This phenomenon is analogous to using a single tree in the bagging algorithm (Bühlmann and Yu, 2002). To reduce this variability, we resort to a multi-sample splitting scheme. We randomly split the data multiple times, repeat the estimation procedure, and average the resulting estimates to obtain the smoothed coefficient estimates. Specifically, for each b = 1, 2, …, B, where B is large, we randomly split the data, D (n) , into D 1 b and D 2 b , with |D 1 b | = n 1 and |D 2 b | = n 2 such that the splitting proportion is q = n 1 /n, 0 < q < 1. Denote the candidate set of variables selected by applying S λ to D 2 b as S b , and the estimates via (3), as β b = β 0 b , β 1 b , …, β p b . Then the smoothed estimator, termed the β = β 0 , β 1 , …, β p , where β j = (4) The procedure is described in Algorithm 1.

Theoretical Results
We specify the following regularity conditions.
(A2) (Bounded eigenvalues and effects) The eigenvalues of Σ = E x T x , where x = (1, x), are bounded below and above by constants c min , c max , such that 0 < c min ≤ λ min Σ < λ max Σ ≤ c max < ∞ .

Algorithm 2
Model-free Variance Estimator Variable selection methods that satisfy the sure screening property are available. For example, Assumptions (A1) and (A2), along with a "beta-min" condition, which stipulates that min j ∈ S* β j * > c 0 n −κ with c 0 > 0, 0 < κ < 1/2, ensure that the commonly used sure independence screening (SIS) procedure (Fan and Song, 2010) satisfy the sure screening property; see Theorem 4 in Fan and Song (2010). While a "beta-min" condition is common in deriving the sure screening property, it is not required for the de-biased type of estimators. We take S to be the SIS procedure when conducting variable selection in simulations and the data analysis. Theorems 1 and 2 correspond to the one-time estimator and the SSGLM estimator, respectively. Theorem 1 Given model (1) and assumptions (A1)-(A3), consider the one-time estimator β = β 0 , β 1 , …, β p T as defined in (3). Denote p s = |S| and σ j Then as n → ∞, i.

Theorem 2 Given model
The added partial orthogonality condition for Theorem 2 is a technical assumption for the validity of the theorem, which has been assumed in the high dimensional literature (Fan and Lv, 2008;Fan and Song, 2010;Wang and Wang, 2014). However, our numerical experiments suggest the robustness of our results to the violation of this condition. In addition, while both of the one-time estimator β j and the SSGLM estimator β j possess asymptotic consistency and normality, the key advantage of β j over β j lies in the efficiency.
An immediate observation is that β j is estimated using all n samples but β j is estimated with only n 1 samples, which explains the different normalization constants in their respective variances, σ j 2 /n and σ j 2 /n 1 . In addition, with σ j 2 depending solely on a one-time variable selection S, its variability is high given the wide variability of S. On the other hand, σ j 2 implicitly averages over the multiple selections, S b 's, and gains efficiency via "the effect of bagging" (Bühlmann and Yu, 2002); also see Web Table 1  We defer the proofs to the Appendix, but provide some intuition here. The randomness of the selection S λ presents difficulties when developing the theoretical properties, but why sure screening works is that, given any subset S ⊇ S*, the estimator β S is consistent, though Fei and Li Page 8 less efficient (with additional noise variables) than the "oracle estimator" β S* acting upon the true active set. The proof also shows that σ j 2 depends on the unknown S*, taking into account the variation in B random splits. Therefore, direct computation of σ j 2 in an analytical form is not feasible. Alternatively, we estimate the variance component via the infinitesimal jackknife method (Efron, 2014;Fei et al., 2019).

Variance Estimator and Inference by SSGLM
The infinitesimal jackknife method has been applied to estimate the variance of the bagged estimator with bootstrap-type resampling (sampling with replacement) (Efron, 2014;Fei et al., 2019). The idea is to treat each β j b as a function of the sub-sample D 1 b , or its empirical distribution represented by the sampling indicator vector J b = (J b1 , J b2 , …, J bn ), where J bi ∈ {0, 1} is an indicator of whether the i th observation is sampled in D 1 b . We further denote where t(·) is a general function that maps the data to the estimator, the expectation E* and the convergence are with respect to the probability measure induced by the random ness of J b 's. We can generalize the infinitesimal jackknife to estimate the variance, Var β j , analogous to equation (8) is the covariance between the estimates β j b 's and the sampling indicators J bi 's with respect to the B splits. Here, n(n − 1)/(n − n 1 ) 2 is a finite-sample correction term with respect to the sub-sampling scheme. Theorem 1 of Wager and Athey (2018) implies that this variance estimator is consistent, in the sense that V j /Var β j p 1 as B → to ∞.
We further propose a bias-corrected version of (5): The derivation is similar to that in Section 4.1 of Wager et al. (2014), but it is adapted to the sub-sampling scheme. The difference between V j and V j B converges to zero with n, B → ∞ to, as it can be re-written as is the sample variance of β j b 's from B splits. Thus both variance estimators are asymptotically equal.
See Algorithm 2 for a summarized procedure of computing the bias-corrected variance estimates.
For finite samples, we give the order of B to control the Monte Carlo errors of these two variance estimators. First, with n 1 = qn for a fixed 0 < q < 1, the bias of V j is of order nv j /B For 0 < α < 1, the asymptotic 100(1 − α)% confidence interval for β j *, j = 1, …, p, is given and the p-value for testing H 0 : β j * = 0 is where Φ is the CDF of the standard normal distribution.

Extension to Subvectors With Fixed Dimensions
We extend the SSGLM procedure to derive confidence regions for a subset of predictors and to test for contrasts of interest. Consider β S (1) * with |S (1) | = p 1 ≥ 2, which is finite and does not increase with n or p. Accordingly, the SSGLM estimator for it is presented in Algorithm 3, and the extension of Theorem 2 is stated below.  where I p1 is a p 1 × p 1 identity matrix, and I (1) is a p 1 × p 1 positive definite matrix depending on S (1) and S* and is defined in the proof.
There is a direct extension of the one-dimensional infinitesimal jackknife for estimating the variance-covariance matrix of β 1 , Σ To test H 0 : Qβ 1 = R, where Q is an r × p 1 contrast matrix and R is an r × 1 vector, a Wald-type test statistic can be formulated as which follows χ r 2 under H 0 . Therefore, with a significance level α ∈ (0, 1), we reject H 0 when T is larger than the (1 − α) × 100 percentile of χ r 2 .

Simulations
We compared the finite sample performance of the proposed SSGLM procedure, under various settings, with two existing methods, the de-biased LASSO for GLMs (Van de Geer et al., 2014;Dezeure et al., 2015) and the de-correlated score test (Ning and Liu, 2017), in estimation accuracy and computation efficiency. We also investigated how the choice of q = n 1 /n, the splitting proportion, may impact the performance of SSGLM, explored various selection methods as part of the SSGLM procedure and their impacts on estimation and inference, illustrated our method with both logistic and Poisson regression settings, and assessed the power and type I error of the procedure. We adopted some challenging simulation settings in Bühlmann et al. (2014). For example, the indices of the active set, as well as the non-zero effect sizes, were randomly generated, and various correlation structures were explored.
However, the MSE was, in general, less sensitive to q when q was getting larger, hinting that a large n 1 may lead to adequate accuracy. Intuitively, there is a minimum sample size n 2 = (1 − q)n required for the selections to achieve the "sure screening" property. For example, LASSO with smaller sample size would select less variables given the same tuning parameter. On the other hand, larger n 1 = qn improves the power of the low dimensional GLM estimators directly. Thus the optimal split proportion is achieved when n 1 is as large as possible, while n 2 is large enough for the sure screening selection to hold. This intuition is also validated in Figure 1, as efficiency is gained faster at the beginning due to better GLM estimators with larger n 1 . This gain is then outweighed by the bias due to poor selections with small n 2 . Our conclusion is that an optimal split proportion exists, but depends on the specific selection method, the true model size, and other factors, rather than being fixed.
We further examined the convergence of the two variance estimators V j and V j B proposed in (5) and (6) with respect to the number of splits, B. Under the same setting, and with q = 0.5, we calculated both V j and V j B for B = 100, 200, …, 2,000, and compared these estimates with the empirical variance of β j 's (considered to be the truth) based on 200 simulation replicates. The right panel of Figure 1 plots the averages over all signals j = 1, 2, …, p and shows V j converges to the truth much slower than V j B , and V j B has small biases even with a relatively small B.
Example 2 implemented various selection methods, LASSO, SCAD, MCP, Elastic net, and Bayesian LASSO, when conducting variable selection for SSGLM, and compared their impacts on estimation and inference. Ten-fold cross-validation was used for the tuning parameters in each selection procedure. We assumed a Poisson model with n = 300, p = 400, and s 0 = 5. For i = 1, …, n, log E Y i x i = β 0 + x i β . Example 3 also assumed model (8). We set n = 400, p = 500, and s 0 = 6, with nonzero coefficients between 0.5 and 1, and three correlation structures: Identity; AR(1) with Σ ij = ρ i − j , ρ = 0.5; Compound Symmetry (CS) with Σ ij = ρ I i ≠ j , ρ = 0.5. Table 2 shows that SSGLM consistently provided nearly unbiased estimates. The obtained standard errors (SEs) were close to the empirical standard deviations (SDs), leading to confidence intervals with coverage probabilities that were close to the 95% nominal level.
Example 5 compared our method with the de-biased LASSO estimator ( Van de Geer et al., 2014) and the de-correlated score test (Ning and Liu, 2017) in terms of power and type I error. We assumed model (9) with n = 200, p = 300, s 0 = 3, and β S* * = (2, − 2, 2) with AR(1) correlation structures. Table 5 summarises the power of detecting each true signal and the average type I error for the noise variables under the AR(1) correlation structure with four correlation values, ρ = 0.25, 0.4, 0.6, 0.75.
Our method was shown to be the most powerful, while maintaining the type I error around the nominal 0.05 level. The power was over 0.9 for the first three scenarios and was above 0.8 with ρ = 0.75. The de-biased LASSO estimators controlled the type I error well, but the power dropped from 0.9 to approximately 0.67 as the correlation among the predictors increased. The de-correlated score tests had the least power and the highest type I error. While these two competing methods have the same efficiency asymptotically, they do differ by specific implementations, for example, the choice of tuning parameters. Indeed, de-biased methods may be sensitive to tuning parameters, which could explain the gap in the finite sample performance. Table 5 summarizes the average computing time (in seconds) of the three methods per data set (R-3.6.2 on an 8-core MacBook Pro). On average, our method took 17.7 seconds, which was the fastest among the three methods. The other two methods were slower for the smaller ρ's (75 and 37 seconds, respectively) and faster for the larger ρ's (41 and 18 seconds, respectively), likely because the node-wise LASSO procedure that was used for estimating the precision matrix tended to be faster when handling more highly correlated predictors.

Data Example
We analyzed a subset of the BLCSC data (Christiani, 2017), consisting of n = 1,459 individuals, among whom 708 were lung cancer patients and 751 were controls. After cleaning, the data contained 6,829 SNPs, along with important demographic variables including age, gender, race, education level, and smoking status ( Table 6). As smoking is known to play a significant role in the development of lung cancer, we were particularly interested in estimating the interactions between the SNPs and smoking status, in addition to their main effects.
We assumed a high-dimensional logistic model with the binary outcome being an indicator of lung cancer status. Predictors included demographic variables, the SNPs (with prefix "AX"), and the interactions between the SNPs and smoking status (with prefix "SAX"; p = 13,663). We applied the SSGLM with B = 1, 000 random splits and drew inference on all 13,663 predictors. Table 7 lists the top predictors ranked by their p-values. We identified 9 significant coefficients after using Bonferroni correction for multiple comparisons. All were interaction terms, providing strong evidence of SNP-smoking interactions, which have rarely been reported. These nine SNPs came from three genes, TUBB, ERBB2, and TYMS. TUBB mutations are associated with both poor treatment response to paclitaxel-containing chemotherapy and poor survival in patients with advanced non-small-cell lung cancer  (Wang et al., 2013).
For comparisons, we applied the de-sparsified estimator for GLM (Bühlmann et al., 2014). A direct application of the "lasso.proj" function in the "hdi" R package (Dezeure et al., 2015) was not feasible given the data size. Instead, we used a shorter sequence of candidate λ values and 5-fold instead of 10-fold cross validation for the node-wise LASSO procedure. This procedure costs approximately one day of CPU time. After correcting for multiple testing, there were two significant coefficients, both of which were interaction terms corresponding to SNPs AX.35719413_C and AX.83477746_A. Both SNPs were from the TUBB gene, and the first SNP was also identified by our method.
To validate our findings, we applied the prediction accuracy measures for nonlinear models proposed in Li and Wang (2019). We calculated the R 2 , the proportion of variation explained in Y, for the models we chose to compare. We report five models and their corresponding R 2 values: Model 1. the baseline model including only the demographic variables (R 2 = 0.0938); Model 2. the baseline model plus the significant interactions after the Bonferroni correction in Table 7 (R 2 = 0.1168); Model 3. Model 2 plus the main effects of its interaction terms (R 2 = 0.1181); Model 4. the baseline model plus the significant interactions from the de-sparsified LASSO method (R 2 = 0.1018); Model 5. Model 4 plus the corresponding main effects (R 2 = 0.1076). Model 2 based on our method explained 25% more variation in Y than the baseline model (from 0.0938 to 0.1168), while Model 4 based on the de-sparsified LASSO method only explains 8.5% more variation (from 0.0938 to 0.1018). We also plotted Receiver-Operating Characteristic (ROC) curves for models 1, 2, and 4 ( Figure 2). Their corresponding areas under the curves (AUCs) were 0.645, 0.69, and 0.668, respectively.
Previous literature has identified several SNPs as potential risk factors for lung cancer. We studied a controversial SNP, rs3117582, from the TUBB gene on chromosome 6. This SNP was identified in association with lung cancer risk in a case/control study by Wang et al. (2008), while on the other hand, Wang et al. (2009) found no evidence of association between the SNP and risk of lung cancer among never-smokers. Our goal was to test this SNP and its interaction with smoking in the presence of all the other predictors under the high dimensional logistic model. Slightly overusing notation, we denoted the coefficients corresponding to rs3117582 and its interaction with smoking as β (1) = (β 1 , β 2 ), and tested H 0 : β 1 = β 2 = 0. Applying the proposed method, we obtained The test statistic of the overall effect was T = 0.062 by (7) with a p-value of 0.97, which concluded that, among the patients in BLCSC, rs3117582 was not significantly related to lung cancer, regardless of the smoking status.

Conclusions
Our approach for drawing inference, by adopting a "split and smoothing" idea, improves upon Fei et al. (2019) which used bootstrap resampling, and recasts a high dimensional inference problem into a sequence of low dimensional estimations. Unlike many of the existing methods (Zhang and Zhang, 2014;Bühlmann et al., 2014;Javanmard and Montanari, 2018), our method is more computationally feasible as it does not require estimating high dimensional precision matrices. Our algorithm enables us to make full use of parallel computing for improved computational efficiency, because fitting the p low dimensional GLMs and randomly splitting the data B times are both separable tasks, which can be implemented in parallel.
We have derived the variance estimator using the infinitesimal jackknife method adapted to the splitting and smoothing procedure (Efron, 2014;Wager and Athey, 2018). This estimator is free of parametric assumptions, resembles bagging (Bühlmann and Yu, 2002), and leads to confidence intervals with correct coverage probabilities. Moreover, we have relaxed the stringent selection consistency assumption on variable selection as required in Fei et al. (2019). We have shown that our procedure works with a mild sure screening assumption for the selection method.
There are open problems to be addressed. First, our method relies on a sparsity condition for the model parameters. We envision that relaxation of the condition may take a major effort, though our preliminary simulations (Example B.2 in Appendix B) suggest that our procedure might work when the sparsity condition fails. Second, as our model is fully parametric, in-depth research is needed to develop a more robust approach when the model is mis-specified. Finally, while our procedure is feasible when p is large (tens of thousands), the computational cost increases substantially when p is extraordinarily large (millions).
Much effort is warranted to enhance its computational efficiency. Nevertheless, our work does provide a starting point for future investigations.
To apply Theorems 2.1 and 2.2 of He and Shao (2000) in the GLM case, we can verify that our Assumptions (A1) and (A2) will lead to their conditions (C1), (C2), (C4) and (C5) with C = 1, r = 2 and A(n, p s ) = p s . To verify their (C3), we first note that their D n is our I S * . Then for any β S , α ∈ R p s such that ‖α‖ 2 = 1, a second order Taylor expansion of U S (β S ) around β S * leads to Hence, Releasing the Restriction on Ω and with P(Ω c ) = P(S ⊇ S*) ≤ K 2 (p ∨ n 2 ) −1−c 2 , we would still have β S − β S * 2 2 = o p p s /n 1 , given p s log p s /n 1 → 0 . To see this, for any ϵ > 0, we can consider P n 1 /p s 1/2 β S − β S * 2 > ϵ < P n 1 /p s 1/2 β S − β S * 2 > ϵ Ω P Ω + P Ω c < P n 1 /p s 1/2 β S − β S * 2 > ϵ Ω + K 2 p ∨ n 2 −1 − c 2 , where both terms in the last inequality converge to 0 as n 1 → ∞ to and n 2 = (1 − q)n 1 /q, with 0 < q < 1 a constant. Similarly, we can show n 1 1/2 β S − β S * + n 1 −1/2 I S * −1 U S β S * 2 = o p 1 , if p s 2 logp s /n 1 0, which can also be written as β S − β S * = − n 1 −1 I S * −1 U S β S * + r n 1 , with r n 1 2 2 = o p 1/n 1 . Consequently, by taking α = (0, 1, 0, …, 0) T and left-multiplying both sides of (10) by n 1/2 α T , we have . ■ The following lemma, which is needed for the proof of Theorem 2, bounds the estimates of coefficients, when the selected subset S b misses important predictors in S* for some 1 ≤ b ≤ B. Although S* ⊈ S b with probability going to zero by Assumption (A3), we need to establish an upper bound in order to control the bias of β j for any j. (1) and Assumptions (A1) and (A2), consider the GLM estimator β S with respect to subset S as defined in (3). Denote by p s = |S|. If p s log p s /n → 0, then with probability going to 1, β S ∞ ≤ C β , where C β > 0 is a constant depending on c min , c max , c β , and A(0).
If S* ⊆ S, the result immediately follows from Theorem 1 by taking C β = 2c β . When S* ⊈ S, the minimizer β S +j is not an unbiased estimator of β S * anymore. However, we show that the boundedness of β S +j is guaranteed from the strong convexity of the objective function ℓ S (β S ) .
To see this, we note that the observed information is ∇ 2 ℓ S β S = I S β S = is strongly convex with respect to β S . Hence, β S +j ∈ β S : ℓ S β S ≤ A 0 , which is a strongly convex set with probability going to 1. As A(0) does not depend on S or the data, there exists a constant C β > 0 (which only depends on A(0), but does not depend on S or the data), such that β S ∞ ≤ C β holds with probability going to 1. ■
With the oracle estimators β j 's's and β j b 's, we have the following decomposition: n β j − β j * = n β j − β j * + n β j − β j = n β j − β j * + n 1 The first two terms in (12), which do not involve various selections S b 's, deal with the oracle estimators and the true active set S*. We need to show the following, which will lead to the results stated in the theorem by using Slutsky's theorem.

a.
I/σ j = n β j − β j * /σ j d N 0, 1 ; b. To prove (c), i.e. III = o p (1), we first note that each subsample D 1 b can be regarded as a random sample of n 1 = qn (0 < q < 1) i.i.d. observations from the population distribution for which Assumption (A3) holds, that is |S b | ≤ K 1 n c1 and P(S* ⊆ S b ) ≥ 1−K 2 (p∨n) −1−c 2 . We show that for any b, conditional on S b ⊇ S*, n β j b − β j b = o p 1 .
To see this, we first arrange the order of the components of x = (x 1 , …, x p ) such that the first s 0 components are signal variables. In other words, S* = {1, …, s 0 }. From (10) in in the proof of Theorem 1 and omitting superscript b, we have that β j − β j * = − n 1 −1 e j T I S +j * −1 U S +j β S +j * + r n 1 , β j − β j * = − n 1 −1 e j T I S +j * * −1 U S +j * β S +j * * + r n 1 , where e j = (0, …, 0, 1, 0, …, 0) T is a unit vector of length |S +j | to index the position of variable j in S +j , e j is a unit vector of length |S +j * | to index the position of variable j in S +j * , and the residuals r n 1 2 2 = o p 1/n 1 , r n 1 2 2 = o p 1/n 1 . Here, I S +j * and I S +j * * are two submatrices of the expected information at β*, i.e. I* = E 1 n Therefore, the jk-th (j, k = 0, 1, …, p) entry of I*, a (p + 1) × (p + 1) matrix, is E x j ν μ x k with μ = g −1 xβ* . Now for any j ∈ S*, k ∈ S*, the complement of S*, the partial orthogonality condition (Fan and Lv, 2008;Fan and Song, 2010) that {x j , j ∈ S*} are independent of {x k , k ∈ S c } implies that E(x j ν(μ)x k ) = 0, as μ only depends on x′ j , j′ ∈ S* and E(x k ) = 0 with centered predictors. Therefore, I* is block-diagonal with two blocks indexed by S* and S c . That is, where the submatrices X S* and X S c are as defined in Section 2.1. Hence, I S +j * is blockdiagonal with two blocks indexed by S* and S +j \S*, and I S +j * * is block-diagonal with two blocks indexed by S* and S +j * \S* = 0 if j ∈ S* or = {j} otherwise.
Therefore . Write U β* = u 0 , u 1 , u 2 , …, u p T , e j T I S +j * −1 = i jk k ∈ S +j and e j T I S +j * * −1 = i jk k ∈ S +j * . Then, it follows that i jk = i jk for k ∈ S*, which leads to n 1 β j − β j = − 1 n 1 k ∈ S +j i jk u k + 1 n 1 k ∈ S +j * i jk u k + r′ n 1 = − 1 n 1 k ∈ S\S* i jk u k + 1 j ∉ S* n 1 i jj − i jj u j + r′ n 1 where r′ n 1 = n 1 r n 1 − r′ n 1 = o p 1 , and r n 1 and r n 1 are as in (13).
With Assumption (A3), S\S* ≤ K 1 n c 1 = o n 1 with 0 ≤ c 1 < 1/2 and, thus, Var k ∈ S\S* i jk u k = o p 1 . By the Chebyshev inequality, the first term on the right hand side converges to 0 in probability. Thus, each of these three terms is o p (1) and we have n 1 β j − β j = o p 1 . As n 1 /n = q where 0 < q < 1, the original statement holds.
, while omitting subscripts j in η for simplicity, is not an unbiased estimator of β j * any more, instead we try to bound it by some constant. By Lemma 4, there exists C β ≥ c β such that With dependent η b 's, we further have where the last inequality holds because of |Cov(η b , η b′ )| ≤ {Var(η b )Var(η b′ )} 1/2 and (14). Then we show III = o p (1). More specifically, for any δ > 0, ζ > 0, take N 0 = ⌊(C β δ) 1/2+c 2 ⌋, where, for a real number a, ⌊a⌋ denotes the integer part of a. When n > N 0 , E(III) ≤ δ/2. Also let N 1 = ζδ 2 / 16C β 2 K 2 c 2 . Then when n > max(N 0 , N 1 ), we have where the first inequality is due to |E(III)| ≤ δ/2 when n > N 0 , the second one is due to the Chebyshev inequality and the last one is due to n > N 1 . ■
Here, for a square matrix, say, Q, (Q) S is a submatrix of Q with rows and columns indexed by S. Denote by I 1 = I S 1 ∪ S* * −1/2 S 1 .
Similar to (12), we have a decomposition nI 1 β 1 − β S 1 * = nI 1 β S 1 − β S 1 * + nI 1 1 Analogous to the derivations in the previous proof, it follows that the second term is o p (1).

Appendix B: Additional Simulations
To assess the robustness of our method, we performed additional simulations when the parametric model was mis-specified and when the sparsity condition was violated.
Example B.1 assumed that Y|x followed a negative binomial distribution: P Y = y = Γ y + r Γ r y! p r 1 − p y , EY = μ = r 1 − p /p = xβ*, with r = 10, sample size n = 300, p = 500, and s 0 = 5. However, we modeled the data using SSGLM under the Poisson regression (8) with B = 300. Table 8 summarizes the results based on 200 simulated data sets. The β j 's had small biases. The estimated standard errors were slightly less than the empirical standard deviations. Nevertheless, the coverage probabilities were still close to the 0.95 nominal level.
Example B.2 assumed a non-sparse truth β* under the Poisson truth (8). With n = 300 and p = 500, we let s 0 = 100. Among the 100 predictors with non-zero effects, β j 's were small, which were randomly drawn from Unif[−0.5, 0.5], and the other 4 had values −1.5, −1, 1, 1.5 (as shown in Figure 3). With many small but non-zero signals, SSGLM still gave nearly unbiased estimates to all of them. See Table 9, where the columns represent 4 large size β j 's, and the averages over all small signals and all noise variables, respectively. SSGLM under non-sparse truth, with p = 500 and s 0 = 100.   ROC curves of the three selected models.  Comparisons of different selection procedures to implement our proposed method. The first column is the indices of the non-zero signals. The last row for the selection frequency is the average number of predictors being selected by each procedure. The last row for the coverage probability is the average coverage probability of all predictors.   SSGLM under the logistic regression, with estimation and inference for the subvector β (1) = β S* . We compare the SSGLM (left) to the oracle model (right), where the oracle estimator is from the low dimensional GLM given the true set S*, and the empirical covariance matrix is with respect to the simulation replications.   SSGLM under Logistic regression, with rejection rates of testing the contrasts. When the truth is 0, the rejection rates estimate the type I error probability; when the truth is nonzero, they estimating the testing power.  Comparisons of SSGLM, Lasso-pro, and De-correlated score test (Dscore) in power, type I error and computing time. AR(1) correlation structure with different ρ's for X are assumed.